A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures
نویسندگان
چکیده
刘奇,刘洋,孙茂松 (清华大学 计算机科学与技术系 智能技术与系统国家重点实验室,北京 100084) 摘要: 平行语料库是对机器翻译、跨语言信息检索等应用技术具有重要支撑作用的基础数据资源。虽然互 联网上的平行网页数量巨大且持续增长,但由于平行网站的异构性和复杂性,如何快速自动获取高质量的 平行网页进而构造平行语料库仍然是巨大的挑战。本文提出了一种 URL 模式与 HTML 结构相结合的平行网页 获取方法,首先利用 HTML结构实现平行网页的递归访问,其次使用 URL模式优化遍历平行网站的拓扑顺序, 从而实现高效准确的平行网页获取。在联合国与香港政府 1 两个平行网站上的实验表明,我们的方法相对传 统获取方法在获取时间上减少 50%以上,准确率提高 15%,并显著提高了机器翻译的质量(BLEU 值分别提 高 1.6 和 0.7 个百分点)。 关键词:平行网页获取;平行语料库;URL 模式;HTML 结构
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملXML: URL Data Set Creation for Future Web Mining Research Avenues
The rapid expansion of the internet has made web a popular place for disseminating and collecting information and also it opens many research topics on varies research fields. Since last few years, several attempts have been made on Web based research particularly based on HTML web pages because of its more availability. So that many Research Data sets have created and few of them are made avai...
متن کاملUse of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems
One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...
متن کاملParallel Sentences Mining From The Web
Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...
متن کاملGWUM : une généralisation des pages Web guidée par les usages
The usage analysis of a Web site based on the extracted sequential patterns is often limited by the low support of these patterns. That is mainly due to the great diversity of the pages and behaviors. However, it is possible to group the majority of these pages in various categories during a preprocessing. Then, using these categories, rather than the URL, will allow us to discover "generic" be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013